Detecting Fake News Project using Machine Learning

Project information

  • Category: NLP, Classification
  • Source Data: Download

Project Details

What is fake news?

Fake news use pieces of news that may be hoaxes and is generally spread through social media and other online media. This is often done to further or impose certain ideas and is often achieved with political agendas. Such news items may contain false and/or exaggerated claims, and may end up being virilized by algorithms, and users may end up in a filter bubble.

What is a TfidfVectorizer?

TF-IDF stands for term frequency- inverse document frequency.It is a measurement widely used in information retrieval (IR) and machine learning. It can quantify the importance or relevance of string representation in a sentence among a collection of sentences.

• TF (Term Frequency): The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.

• IDF (Inverse Document Frequency): Words that occur many times a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.

As the dataset has labeled the variable, it is supervised machine learning. Supervised machine learning is a type of machine learning that are trained on well-labeled training data. Labeled data means that training data is already tagged with correct output.

In this project, I will solve the problem using PassiveAggressive classifier, logistical regression and svc and compare their accuracy as it is a supervised algorithm which relies on training data to learn and improve their accuracy over time. Once these learning algorithms are fine-tuned for accuracy, they are powerful tools in classification and clustering data.

Python Packages:

  • 1. Pandas
  • 2. Sklearn
  • Roadmap:

  • 1. Load the data
  • 2. Split a dataset into training and testing datasets
  • 3. Train the model
  • 4. Model Evaluation
  • Step 1:
    Import packages:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import PassiveAggressiveClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import LinearSVC
    from sklearn.metrics import accuracy_score, confusion_matrix
    		    
  • First, I imported necessary packages for this project.
  • Pandas will used to load data from various sources like local or cloud storage, database, excel file, CSV file and so on.
  • Train_test_split will be used to split dataset for training and testing.
  • TfidfVectorizer will be used to convert a collection of raw documents into a matrix of TF-IDF features.
  • PassiveAggressiveClassifier , logisticRegression and LinearSVC will be used for classifying the result.
  • Accuracy_score will be used to assess accuracy.
  • Classification_report will be used to run summary reports.
  • #Read the data
    file_path=r'C:\Users\shang\Desktop\ML\data\news.csv'
    df=pd.read_csv(file_path)
    print(df.head())
    		    
  • I define the path variable and then use read_csv to read data.
  • Look at the first five-row data in the dataset.
  • X=df['text']
    Y=df.label
    print(Y.head())
    		    
  • X features a dataset including all variables used to train the model. Also, the feature data would be converted into float data format.
  • Y is labeled dataset including target values.
  • Step2: Split a dataset into training and testing datasets
    # Split the dataset
    X_train,X_test,y_train,y_test=train_test_split(X, Y, test_size=0.2, random_state=42)
    		    
  • I split the whole dataset into two groups: training dataset and testing dataset using train_test_split by 80% verse 20%. I will use the testing dataset to check the accuracy of the mode later.
  • Step3: Train the model
    # Initialize a TfidfVectorizer
    tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
    # Fit and transform train set, transform test set
    converted_train=tfidf_vectorizer.fit_transform(X_train)
    converted_test=tfidf_vectorizer.transform(X_test)
    		    
  • Initialize a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7 (terms with a higher document frequency will be discarded).
  • Stop words are the most common words in a language that are to be filtered out before processing the natural language data.
  • The TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features.
  • Fit and transform the vectorizer on the train set, and transform the vectorizer on the test set.
  • # Train different algorithms and compare them using accuracy
    models = []
    models.append(('LogisticRegression', LogisticRegression()))
    models.append(('LinearSVC', LinearSVC()))
    models.append(('PassiveAggressiveClassifier', PassiveAggressiveClassifier()))
    		    
  • I trained 3 models using the following algorithms: Logistic Regression, Linear SVC, and Passive Aggressive Classifier. I chose them because they are popular and give the best results and you can experiment with other algorithms too.
  • Step 4: Model Evaluation
    # evaluate each model in turn
    names = []
    scoring = 'accuracy'
    for name, model in models:
        model=model.fit(converted_train, y_train)
        pred = model.predict(converted_test)
        score=accuracy_score(y_test,pred)
        # DataFlair - Build confusion matrix
        matrix = confusion_matrix(y_test, pred,labels=['FAKE','REAL'])
        print(f'Accuracy: {round(score*100,2)}%')
        print('confusion matrix:'+name)
        print(matrix)
    		    
  • I calculated the accuracy with accuracy_score() from sklearn.metrics.
  • I printed out a confusion matrix to gain insight into the number of false and true negatives and positives.
  • By comparing the accuracy of all the methods, I conclude that TF-IDF AND PassiveAggressiveClassifier gave the best performance (%94.71) on the validation and the test dataset.